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Abstract 


Each individual handles many tasks of finding the most profitable op¬ 
tion from a set of options that stochastically provide rewards. Our society 
comprises a collection of such individuals, and the society is expected to 
maximise the total rewards, while the individuals compete for common re¬ 
wards. Such collective decision making is formulated as the ‘competitive 
multi-armed bandit problem (CBP)’, requiring a huge computational cost. 
Herein, we demonstrate a prototype of an analog computer that efficiently 
solves CBPs by exploiting the physical dynamics of numerous fluids in cou¬ 
pled cylinders. This device enables the maximisation of the total rewards for 
the society without paying the conventionally required computational cost; 
this is because the fluids estimate the reward probabilities of the options for 
the exploitation of past knowledge and generate random fluctuations for the 
exploration of new knowledge. Our results suggest that to optimise the social 
rewards, the utilisation of fluid-derived natural fluctuations is more advanta¬ 
geous than applying artificial external fluctuations. Our analog computing 
scheme is expected to trigger further studies for harnessing the huge compu¬ 
tational power of natural phenomena for resolving a wide variety of complex 
problems in modern information society. 


Introduction 

The benefits to an organization (the whole) and those to its constituent members 
(parts) sometimes conflict. For example, let us consider a situation wherein traf¬ 
fic congestion is caused by a driver making a selfish decision to pursue his/her 
individual benefit to quickly arrive at a destination. In a situation wherein a car 
bound from south to north approaches an intersection where preceding vehicles 
are stalled while the signal is about to turn in red, the driver must refrain from 
selfishly deciding to enter the intersection. Otherwise, the car would obstruct other 
vehicles’ paths in the west and east directions, stalled in the intersection after the 
signal turned red. Thus, the whole’s benefit can be spoiled by that of a part. 

The conflict between the whole’s benefit and that of the parts frequently arises 
in a wide variety of situations in modern society. Confrontations between commu¬ 
nities and wars between nations can be seen as caused by collisions of global and 
local interests. In realistic political judgment, many of these collisions are mod¬ 
elled using a game-theoretic approach by appropriately setting up a payoff matrix 
fill . In mobile communication, the channel assignment problem in cognitive ra¬ 
dio communication can also be represented as a particular class of payoff matrix. 
Herein, we consider the competitive bandit problem (CBP), which a problem of 
maximising total rewards through collective decision making and requires a huge 
computational cost for an increase in problem size. We demonstrate a method for 
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exploiting the computational power of the physical dynamics of numerous fluids 
in coupled cylinders to efficiently solve the problem. 

Consider two slot machines. Both machines have individual reward probabili¬ 
ties Pa and Pb■ At each trial, a player selects one of the machines and obtains some 
reward, a coin for example, with the corresponding probability. The player wants 
to maximise the total reward sum obtained after a particular number of selections. 
However, it is assumed that the player does not know these probabilities. How can 
the player gain maximal rewards? The multi-armed bandit problem (BP) involves 
determining the optimal strategy for selecting the machine which yields maximum 
rewards by referring to past experiences. 



Machine A Machine B 


Probability P A 




Player 2: A 

Player 2: B 

Player 1:A 

Pa/2 (Pa/2) 

Pa (Pb) 

Player 1: B 

Pb (Pa) 

Pb/2 (Pb/2) 


Figure 1: Competitive Bandit Problem (CBP). (a) segregation state, (b) collision 
state, (c) Payoff matrix for player 1 (player 2). 

For simplicity, we consider here the minimum CBP, i.e. two players (1 and 
2) and two machines ( A and B), as shown in Fig. [T] It is supposed that a player 
playing a machine can obtain some reward, a coin for example, with the probability 
Pi. Figure [Qc) shows the payoff matrix for players 1 and 2. 

If a collision occurs, i.e. two players select the same machine, the reward is 
evenly split between those players. We seek an algorithm that can obtain the max- 
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imum total rewards (scores) of all players. To acquire the maximum total rewards, 
the algorithm must contain a mechanism that can avoid the ‘Nash equilibrium' 
states, which arc the natural consequence for a group of independent selfish play¬ 
ers, and can determine the ‘social maximum 0’ states which can obtain maximum 
total rewards. In our previous studies lO [8 5jh F71[8lfl6l. we showed that our pro¬ 
posed algorithm called ‘Tug-Of-War (TOW) dynamics’ is more efficient than other 
well-known algorithms such as the modified e-grccdy and softmax algorithms, and 
is comparable to the ‘upper confidence boundl-tuned (UCB1T) algorithm', which 
is known as the best among parameter-free algorithms @. Moreover, TOW dy¬ 
namics effectively adapts to a changing environment wherein the reward probabil¬ 
ities dynamically switch. Algorithms for solving CBP are applicable to various 
fields such as Monte Carlo tree search, which is used in algorithms for the ‘game 
of GO’ IHTHT21 . cognitive radio lfl3lH4l . and web advertising Ifl5l . 

Herein, by applying TOW dynamics that exploit the volume conservation law, 
we propose a physical device that efficiently computes the optimal machine assign¬ 
ments of all players in a centralised control. The proposed device consists of two 
kinds of fluids in cylinders: one representing ‘decision making by a player’ and 
the other representing the ‘interaction between players (collision avoider)’. We 
call the physical device the ‘TOW bombe’ owing to its similarity to the ‘Turing 
bombe’ invented by Alan Turing, the analog electric circuit used by the British 
army for decoding the German army’s ‘enigma code’ of the during World War II 
m . The assignment problem for M players and N machines can be automati¬ 
cally solved simply by repeatedly operating (up-and-down operation of the fluid 
interface in a cylinder) M times at every iteration in the TOW bombe without cal¬ 
culating the evaluation values of 0(N M ). This suggests that an analog computer is 
more advantageous than a digital computer, if we appropriately use the natural phe¬ 
nomena. Although the problems considered here are not really nondeterministic- 
polynomial-time (NP) problems, we can show advantages of natural fluctuations 
generated in the device and suggest a possibility to extend the device to apply to 
NP problems. Using the TOW bombe, we can automatically achieve the social 
maximum assignments by entrusting the huge amount of computations for evalua¬ 
tion values to the physical processes of fluids. 

Consider an incompressible fluid in a cylinder, as shown in Fig. [2ja). Here, 
X k corresponds to the displacement of terminal k from an initial position, where 
k € {A,B). If' X k is greater than 0, we consider that the liquid selects machine k. 

We used the following estimate Q k (k e {A, />’)): 

Qk(t) = N k (t)-(\ + co)L k (t). (1) 

Here, N k is the number of playing machine k until time t and L k is the number of 
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Figure 2: (a) TOW dynamics, (b) The TOW bombe for three players and five 
channels. 

non-rewarded (i.e. failed) events in k until time t, where m is a weighting parameter 
(see Method). 

The displacement Xa (= -Xg) is determined by the following difference equa¬ 
tion: 

X A (t) = Q A (t) ~ Q B (t) + 6. (2) 

Here, 5{t) is an arbitrary fluctuation to which the liquid is subjected. Consequently 
the TOW dynamics evolve according to a particularly simple rule: in addition to 
the fluctuation, if machine k is played at each time f, +1 and -co are added to Xk(t- 
1) when rewarded and non-rewarded, respectively (Fig. |2|a)). The authors have 
shown that this simple dynamics gains more rewards (coins or packet transmissions 
in cognitive radio) than those obtained by other popular algorithms for solving the 
BP E |8] 13 HE] El HU. 
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Many algorithms for the BP estimate the reward probability of each machine. 
In most cases, this ‘estimate’ is updated only when the corresponding machine 
is selected. In contrast, TOW dynamics uses a unique learning method which is 
equivalent to that updating both estimates simultaneously owing to the volume 
conservation law (3 0 . TOW dynamics can imitate the system that determines its 
next moves at time t + 1 in referring to the estimate of each machine, even if it was 
not selected at time t, as if the two machines were simultaneously selected at time t. 
This unique feature is one of the sources of the TOW’s high performance fl6l . We 
call this the ‘TOW principle’. This principle is also applicable to a more general 
BP (see Method). 

1 Results 

The TOW bombe for three players (1,2 and 3) and five machines ( A , B, C, D and E) 
is illustrated in Figure [2jb). Two kinds of incompressible fluids (blue and yellow) 
fill coupled cylinders. The blue (bottom) fluid handles a player’s decisions made, 
while the yellow (upper) one handles interaction among players. Machine selection 
of each player at each iteration is determined by the height of a red adjuster (a fluid 
interface level), and the highest machine is chosen. When the movements of blue 
and yellow adjusters stabilise to reach equilibrium, the TOW principle in the blue 
fluid holds for each player. In other words, when one interface rises, the other 
four interfaces fall, resulting in efficient machine selections. Simultaneously, the 
action-reaction law holds for the yellow fluid (i.e. if the interface level of player 1 
rises, the interface levels of players 2 and 3 fall), contributing collision avoidance, 
and the TOW bombe can search for an overall optimisation solution accurately 
and quickly. In normal use, however, blue and yellow adjusters must have fixed 
positions not to move. 

The dynamics of the TOW bombe are expressed as follows: 


Q(i,k)(t) - 

XQ{i,k)(t ) + Qu,k)(t - 1) 



m_ x Yj a 2ot)(0, 

i*i 

(3) 

x m (t) = 

Q(i,k)(t) _ x Yj 

(4) 


l*k 


Here, Xa^U) is the height of the interface of player i and machine k at iteration 
step t. If machine k is chosen for player i at time f, A Q^k)(t) is +1 or -to according 
to the result (rewarded or not). Otherwise, it is 0. 
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In addition to the above-mentioned dynamics, some fluctuations or external 
oscillations are added to X^. These added fluctuations or oscillations are sensi¬ 
tive to the TOW bombe’s performance, because fluctuations represent exploration 
patterns in the early stage. 

Thus, the TOW bombe operates only by adding an operation which raises or 
lowers the interface level (+1 or -oS) according to the result (success or failure of 
coin gain) for each player (total M times) at each time. After these operations, 
the interface levels move according to the volume conservation law, calculating 
the next selection for each player. In each player’s selection, an efficient search is 
achieved as a result of the TOW principle, which can obtain a solution accurately 
and quickly for trial-and-error tasks. Moreover, through the interaction among 
players via yellow fluid, the Nash equilibrium can be avoided, thereby achieving 
the social maximum f2|. 

To show that the TOW bombe avoids the Nash equilibrium and regularly achieves 
an overall optimisation, we consider a case wherein (Pa, Pb, Pc, Pd, Pe) = (0.03, 
0.05, 0.1, 0.2, 0.9) as a typical example. For simplicity, part of the payoff tensor 
that has 125 (=5 3 ) elements is described as follows; only matrix elements for which 
each player does not choose low-ranking A and B are shown (Table[5j|6]and[7]>. For 
each matrix element, the reward probabilities are given in the order of players 1, 2 
and 3. 

Table 1: Payoff matrix of the case where (Pc, Pd, Pe)=( 0.1, 0.2, 0.9), player 3 
chooses C 



player 2: C 

player 2: D 

player 2: E 

player 1: C 

1/30, 1/30, 1/30 

0.05, 0.2, 0.05 

0.05, 0.9, 0.05 

player 1: D 

0.2, 0.05, 0.05 

0.1, 0.1, 0.1 

0.2, 0.9, 0.1 SM 

player 1: E 

0.9, 0.05, 0.05 

0.9, 0.2, 0.1 SM 

0.45,0.45,0.1 


Table 2: Payoff matrix of the case where (Pc, Pd, Pe)=( 0.1, 0.2, 0.9), player 3 
chooses D 



player 2: C 

player 2: D 

player 2: E 

player 1: C 

0.05,0.05,0.2 

0.1, 0.1, 0.1 

0.1, 0.9, 0.2 SM 

player 1: D 

0.1, 0.1, 0.1 

2/30,2/30, 2/30 

0.1, 0.9, 0.1 

player 1: E 

0.9, 0.1, 0.2 SM 

0.9, 0.1, 0.1 

0.45, 0.45, 0.2 
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Table 3: Payoff matrix of the case where (Pc, Pd, Pe)=( 0.1, 0.2, 0.9), player 3 
chooses E 



player 2: C 

player 2: D 

player 2: E 

player 1: C 

0.05, 0.05, 0.9 

0.1, 0.2, 0.9 SM 

0.1,0.45,0.45 

player 1: D 

0.2, 0.1, 0.9 SM 

0.1, 0.1, 0.9 

0.2, 0.45, 0.45 

player 1: E 

0.45,0.1,0.45 

0.45,0.2, 0.45 

0.3, 0.3, 0.3 NE 


Social maximum (SM) is a state in which the maximum amount of total reward 
is obtained by all the players. In this problem, the social maximum corresponds 
to a segregation state in which the players choose the top three distinct machines 
(C, D, E ), respectively; there are six segregation states indicated by SM in the Ta¬ 
bles. In contrast, the Nash equilibrium (NE) is a state in which all the players 
choose machine E independent of others’ decisions; machine E gives the reward 
with the highest probability, when each player behaves selfishly. 

The performance of the TOW bombe was evaluated using a score: the num¬ 
ber of rewards (coins) a player obtained in his/her 1,000 plays. In cognitive radio 
communication, the score corresponds to the number of packets that have success¬ 
fully transmitted lfT3llT4ll . Figure [3j a) shows the TOW bombe scores in the typical 
example wherein (Pa, Pb, Pc, Pd, Pe) = (0.03, 0.05, 0.1, 0.2, 0.9). Since 1,000 
samples were used, there are 1,000 circles. Each circle indicates the score obtained 
by player i (horizontal axis) and player j (vertical axis) for one sample. There arc 
six clusters in Figure [3j a) corresponding to the two-dimensional projections of the 
six segregation states, implying the overall optimisation. The social maximum 
points are given as follows: (the score of player 1, the score of player 2, the score 
of player 3) = (100, 200, 900), (100, 900, 200), (200, 100, 900), (200, 900, 100), 
(900, 100, 200) and (900, 200, 100). The TOW bombe did not reach the Nash 
equilibrium state (300, 300, 300). 

In our simulations, we used ‘adaptive’ weighting parameter oj, meaning that 
the parameter is estimated by using its own variables (see Method). Owing to this 
estimation cost, clusters of circles arc not located exactly at the social maximum 
points. If we set weighting parameter a> at 0.08, which are calculated as y' ~P n+P c 
(see Method), those clusters are located exactly on the social maximum points (see 
Figures in ifTTll ). 

Figure [3jb) shows TOW bombe performance, sample averages of the total 
scores of all players up to 1,000 plays, for three different type of fluctuation, 
respectively. The black, red and blue lines denote the cases of internal random 
fluctuations, internal fixed fluctuations and external oscillations, respectively (see 










Total score 



Score of playe r 1 

b c 




Figure 3: (a) TOW bombe scores in the case wherein (Pa, Pg, Pc, Pd, Pe) - (0.03, 
0.05, 0.1, 0.2, 0.9). (b) Sample averages of total TOW bombe scores in the case 
wherein (Pa, Pb, Pc, Pd, Pe) - (0.03, 0.05, 0.1, 0.2, 0.9). (c) Sample averages of 
mean distance between players’ scores in the case wherein (Pa, Pb, Pc, Pd, Pe) - 
(0.03, 0.05,0.1,0.2,0.9). 

Method). The horizontal axis denotes the sample averages of maximum fluc¬ 
tuation. In the maximal case, the average total score has gained nearly 1,200 
(=100+200+900), which is the value of the social maximum, although there are 
some gaps resulting from estimation costs. 

Figure |3jc) also shows TOW bombe fairness, sample averages of the mean 
distance between players’ scores, for three dilferent types of fluctuation, respec¬ 
tively. We can confirm lower fairness in the cases of internal fixed fluctuations 
(red line). Artificially created fluctuations, such as internal fixed fluctuations, of¬ 
ten show lower fairness because of the existence of biases (lack of uniformity or 
randomness) in fluctuations. Although the external oscillations (sine waves) have 
higher fairness (blue line), controlling the blue and yellow adjusters appropriately 
is difficult. Moreover, the performances of these two types of fluctuation rapidly 
decrease as the magnitude of fluctuations increases, as shown in Fig. [3jb). 

We can conclude that only the internal random fluctuations, which are supposed 
to be generated automatically in the real TOW bombe, exhibit higher performance 
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and fairness. This conclusion is consistent even in cases where we set weighting 
parameter co at 0.08. This indicates the construction of a novel analog computing 
scheme which exploits nature’s power in terms of automatic generation of random 
fluctuations, simultaneous computations using a conservation law and intrinsic ef¬ 
ficiency. 

2 Discussion 

How can we harness nature’s power for computations such as automatic generation 
of random fluctuations, simultaneous computations using a conservation law and 
intrinsic efficiency as well as the feasibility of massive computations? 

Alan Turing mathematically clarified a concept of ‘computation’ by proposing 
his Turing machine, the most simple model of computation Ifl8l f l9l. A Turing 
machine consists of a sequence of steps which can read and write a single symbol 
on tape. These ‘discrete’ and ‘sequential’ steps are ‘simple’ for a human to under¬ 
stand. Moreover, he found a ‘universal Turing machine’ that can simulate all other 
computations. Owing to this machine, algorithms can be studied on their own, 
without regard to the systems that are implementing them If20ll . Human beings no 
longer need to be concerned about underlying mechanisms. In other words, soft¬ 
ware can be abstracted away from hardware. This property has brought substantial 
development in digital computers. Simultaneously, however, these algorithms have 
lost links to natural phenomena implementing them. He had exchanged natural 
affinity for artificial convenience. 

Digital computers created a ‘monster’ called ‘exponential explosion’, wherein 
computational cost grows exponentially as a function of problem size (NP prob¬ 
lems). In our daily lives, we often encounter this type of problem, such as schedul¬ 
ing, satisfiability (SAT) and resource allocation problems. For a digital computer, 
such problems become intractable as the problem size grows. In contrast, nature 
always ‘computes’ infinitely many computations at every moment ETTl . However, 
we do not know how to extract and harness this power of nature. 

Herein, we demonstrate that an analog decision-making device, called the TOW 
bombe, can be implemented physically by using two kinds of incompressible fluid 
in coupled cylinders and can efficiently achieve overall optimisation in the machine 
assignment problem in CBP by exploiting nature’s power, including automatic gen¬ 
eration of random fluctuations, simultaneous computations using a conservation 
law and intrinsic efficiency. The randomness of fluctuations generated automati¬ 
cally in the real TOW bombe might not be high, but there are ways to enhance 
randomness. For example, turbulence occurs if we move an adjuster rapidly in an 
up-and-down operation. 
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The TOW bombe enables us to solve the assignment problem for M players 
and N machines by repeating M up-and-down operations of fluid interface levels 
in cylinders at each iteration; it does not require calculation for as many evalua¬ 
tion values as are required when using a conventional digital computer, because it 
entrusts the huge amount of computation to the physical processes of fluids. This 
suggests that there are advantages to analog computation even in today’s digital 
age. 

Although the payoff tensor has N M elements, the TOW bombe need not hold 
N m evaluation values. If we ignore the diagonal elements, N evaluation values are 
sufficient for each player’s estimation of which machine is the best. Therefore, us¬ 
ing the TOW bombe, the CBP is reducible to an 0(NM ) problem when implement¬ 
ing a collision-avoiding mechanism handled by yellow fluid, although, in a strict 
sense, the computational cost must include the cost for providing random fluctu¬ 
ations generated by the fluids’ physical dynamics. In Fig. (3|b), we showed the 
results of only three types of fluctuation. TOW bombe performance with internal 
M-random fluctuations (see Method) was the same as that of the internal random 
fluctuations, although computations for generating the former type of fluctuation 
require a cost that exponentially grows as 0(N M ). This is because the exponential 
type of fluctuation is not effective for 0{NM) problems. Various random seed pat¬ 
terns do not affect enhancing the performance of 0{NM ) problems because of the 
reducibility of CBP to three independent BPs. 

However, this is not the cases if we focus on more complex problems, such 
as the ‘Extended Prisoner’s Dilemma Game’ (see Supplementary Information); we 
must prepare more than NM evaluation values, because a player’s reward is dras¬ 
tically changed according to the selections of other players in this problem. There 
are some cases that can be approximately solved by the TOW bombe even in this 
type of complex problem. In these cases, the exponential type of fluctuation can en¬ 
hance the performance slightly. This fact may suggest that we find the first toehold 
to harnessing nature’s power which is the feasibility of massive computations. 

Unfortunately, it is difficult to solve this type of complex problem using the 
TOW bombe in general. To solve more complex problems, we must also extend the 
TOW bombe. We have some ideas regarding TOW bombe extension using some 
fluid compressibility, local inflow and outflow, a reservoir for blue or yellow fluid, 
a time order of fluctuations and quantum effects such as non-locality and entangle¬ 
ment. The TOW bombe can also be implemented on the basis of quantum physics. 
In fact, the authors have exploited optical energy transfer dynamics between quan¬ 
tum dots and single photons to design decision-making devices Il22ll23 , 24j. Our 
method might be applicable to a class of problems derived from CBP and broader 
varieties of game payoff tensors, implying that wider applications can be expected. 
We will report these observations and results elsewhere in the future. 
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Methods 


The weighting parameter oj 

TOW dynamics involves the parameter to which is sensitive to its performance. 
From analytical calculations, it is known that the following too is sub-optimal in 
the BP (see Supplementary Information or 

wo = (5) 

2-7 

7 = P A + Pb■ (6) 

Here, it is assumed that Pa is the largest reward probability and Pg is the second 
largest. 

In the CBP cases (M-player and TV-machine), the following mo is sub-optimal, 

wo = (7) 

2 - y 

y' = P(M) + P(M+ 1 ) ( 8 ) 

Here, P^m) is the top M th reward probability. 

Players must estimate loq using its variables, because information regarding 
reward probabilities is not given to players. We call this an ‘adaptive’ weighting 
parameter. There are many estimate methods, such as Bayesian inference, but we 
simply use ‘direct substitution’ herein. Direct substitution uses R/(t)/Nj(t ) for Pj, 
where Rj(t) is the number of reward gains from machine j through time t and Nj(t) 
is the number of plays of machine j through time t. 

TOW dynamics for general BP 

In this paper, we use TOW dynamics only for the Bernoulli type of BP in which 
the reward r is 1 or 0. Another type of TOW dynamics can also be constructed for 
general BP in which the reward r is a real value from an interval [0,7?]. Here, R is 
arbitrary positive value, and the reward r is selected according to given probability 
distribution whose mean and variance are p and <x 2 , respectively. 

In this case, the following estimate Qk (k e {A, 5}) is used insted of eq. Q: 

Qk(t) - ^) =l r k (j) - y*N k (t). (9) 

Here, Nk is the number of playing machine k until time t and rpj) is the reward in 
k at time j, where y* is the following parameter: 

7 = -j-’ (10) 

If machine k is played at each time t, the reward rpt) and -y* are added to Xk(t- 1). 
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Generating methods of fluctuation 

1. Internal fixed fluctuations 

First, we define fixed moves 0 w (k' = 0, • • •, 4), as follows, 


{00,01,02,03,04} = 10, A, 0, -A, 0} (11) 

Here, A is an amplitude parameter. Note that 2^ =0 0 w - 0. 

To use the above move Ok recursively, we introduce a new variable num (num — 
0, • • •, 4), as follows, 

num = \t + (k — 2)} mod 5 (12) 

Here, t is a time. For each machine k (k = 1, • • •, 5), we use the following set of 
fluctuations, respectively. 


OSC ( l, k )(t) 

= 0o, 

(13) 

OSC(2,k)(t ) 

- 03, 

(14) 

OSC(3' k )(t) 

= 01. 

(15) 


If num = 0. 


OSC(l,k)(t) 

= 0i, 

(16) 

OSC(2,k)(t ) 

= 04, 

(17) 

osc (3,k)(t) 

- 03- 

(18) 


If num = 1. 


OSC ( l,k)(t) 

- 02, 

(19) 

OSC(2,k)(t) 

= 00, 

(20) 

OSC(3,k)(t ) 

- 04- 

(21) 


If num — 2. 


osc {h k)(t) 

= 03, 

(22) 

OSC(2,k)(t ) 

- 01, 

(23) 

OSC(3'k)(t) 

= 02- 

(24) 


If num = 3. 
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OSC (hk) (t) 

= 04, 

(25) 

OSC {2 ,k)(t) 

= 0 2 , 

(26) 

osc { xk)(t) 

= O 0 . 

(27) 


If num = 4. 

It always holds that ^f=i osc (i,k)(t) = 0 and £| =1 osc (i,k)(t) = 0. These con¬ 
ditions mean that added fluctuations to can be cancelled in total. In other 
words, the total volume of blue or yellow fluid does not change. As a result, we 
create artificial ‘internal' fluctuations. 

2. Internal random fluctuations 

First, a matrix sheet of random fluctuations (Sheets) is prepared. Here, i = 
1, • • • ,3 and k = 1, • • • ,5. 

1. r is a random value from [0,1], We call this ‘seed’. 

2. There arc NM (— 15) possibilities for a seed position. Choose the seed 
position (z'o, ko) randomly from z'o = 1, • • •, 3 and ko = 1, • • •, 5 and place the 
seed r at the point, 

S heet(i M - r. (28) 

3. All elements of the /coth column other than (i 0 , ko) are substituted with -0.5 * 
r. 

4. All elements of the z'o-th row other than (i 0 , ko) are substituted with -0.25 * r. 

5. All remaining elements are substituted with r/8.0. 

6. The matrix sheet is accumulated in a summation matrix S um {L k)- 

7. Repeat from two to six for D times. Here, D is a parameter. 

We used the following set of fluctuations, 

osc(i,k)(t) = A/D * S um^k). (29) 

Here, A is an amplitude parameter. 

It always holds that X/ti °scr,X)(t) = 0 and X/^i osca^it) = 0, as well as 
the internal fixed fluctuations. The total volume of blue or yellow fluid does not 
change. As a result, we create ‘internal’ random fluctuations naturally. At every 
time step, this procedure costs 0(N ■ M) computations with a digital computer. 
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3. Internal M-random fluctuations (exponential) 

First, a matrix sheet of random fluctuations {Sheets) is prepared. Here, i — 

1, • • • ,3 and k = 1, ■ ■ • ,5. 

1. For each player i, independent random value r\ is generated from [0,1]. We 
call these ‘seeds’. 

2. There arc N M (= 125) possibilities for a seed position pattern. For each 
player i, choose the seed position (/, ko(i)) randomly from ko(i) = 1, ■ ■ • ,5 
and place the seed r\ at the point, 

S heeta^d)) = ?v. (30) 

However, we choose ko(i)s to be distinct. Therefore, there arc really N(N - 
1 )(N - 2) (= 60) possibilities. 

3. For each i, all elements of the ko(i)-th column other than (/, &o(0) are substi¬ 
tuted with -0.5 * n. 

4. All remaining elements of the 1th row are substituted with -0.50* (r\ -0.50* 
r 2 - 0.50 * r$). 

5. All remaining elements of the 2 th row arc substituted with -0.50 * (r 2 - 
0.50 * r 1 - 0.50 * r 3 ). 

6 . All remaining elements of the 3 th row arc substituted with -0.50 * (r$ - 
0.50 * r\ - 0.50 * r^). 

7. The matrix sheet is accumulated in a summation matrix S um {l j i) . 

8 . Repeat from two to seven for D times. Here, D is a parameter. 

We used the following set of fluctuations, 


osc (Uk) = A/D * S um (kk) . (31) 

Here, A is an amplitude parameter. 

It always holds that osca^it) = 0 and Yil=i osc^it) = 0 as well as the 
internal fixed or random fluctuations. The total volume of blue or yellow fluid does 
not change. As a result, we create ‘internal’ M-random fluctuations naturally. At 
every time step, this procedure costs exponential computations of 0{N M ) with a 
digital computer. 
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4. External oscillations 


Herein, we used completely synchronised oscillations osca^U) added to every 
player’s X (Uk) , 

osc m {t) = A sin(lnt/5 + 2n(k - l)/5). (32) 

Here, i — 1, - - 3 and k = 1, • • •, 5. A is an amplitude parameter. These oscillations 
are externally provided by appropriately controlling the blue and yellow adjusters. 
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Supplementary Information 


Efficient Decision-Making by Physical Objects 

Elements of computing devices are subject to physical laws that work as ‘con¬ 
straints’. These constraints always have negative effects on computing ability. For 
example, in a complementary metal oxide semiconductor (CMOS) structure, con¬ 
siderably complicated circuits are required even for simple logical operations such 
as NAND and NOR because physical constraints can violate logically correct be¬ 
haviour in simple structures. 

However, herein, we show the opposite fact that such constraints can also have 
positive effects. That is, computational efficiency can be generated from the move¬ 
ments of physical objects which are subjected to the volume conservation law. This 
‘tug-of-war (TOW) principle’ is addressed as efficiency in ‘trial-and-error’. 

Consider two slot machines. Both machines have individual reward probabil¬ 
ities Pa and Eg. At each trial, a player selects one of the machines and obtains 
some reward, a coin for example, with the corresponding probability. The player 
wants to maximize the total reward sum obtained after a particular number of se¬ 
lections. However, it is supposed that the player does not know these probabilities. 
The multi-armed bandit problem (BP) involves determining the optimal strategy 
for selecting the machine which yields maximum rewards by referring to past ex¬ 
periences. 

The BP was originally described by Robbins HI, although the essential prob¬ 
lem was studied earlier by Thompson O. The optimal strategy is known only 
for a limited class of problems wherein the reward distributions are assumed to be 
‘known’ to the players ESI- Even in these problems, computing the Gittins index 
in practice is not tractable for many problems. Agrawal and Auer et al. proposed 
decision-making algorithms that could express the index as a simple function of 
the total reward obtained from a machine ||5l|6). Especially, the ‘upper confidence 
bound 1 (UCB1) algorithm' proposed by Auer is used worldwide for many appli¬ 
cations. 

Kim et al. proposed a ‘decision-making dynamics’ called TOW; it was in¬ 
spired by the true slime mold Physarum IT71 151 l9l fTOl fill fl2ll . which maintains a 
constant intracellular resource volume while collecting environmental information 
by concurrently expanding and shrinking its branches. The conservation law en¬ 
tails a ‘non-local correlation’ among the branches, that is, the volume increment in 
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one branch is immediately compensated for by volume decrement(s) in the other 
branch(es). This non-local correlation was shown to be useful for decision making. 

In this paper, we propose ‘the TOW principle’, which explains why computa¬ 
tional efficiency can be generated from the movements of physical objects that arc 
subjected to the volume conservation law. 

Solvability: random walk approach 

Let us consider a one-dimensional random walk, where the distance of right flight 
when ‘a coin’ is dispensed is a and the distance of left flight when ‘no coin’ is 
dispensed is p. We assume that Pa (probability of right flight in random walk A) > 
Pb (probability of right flight in random walk B) for simplicity. After time step t, 
the displacement Rk(t) (k e {A, B\) can be described by 

Rk(t) = o(Nk - Lk) - P Lk 

- aNk - (a + P) Lk. (33) 

Here, Nk is the number of playing machine k until time t and Lk is the number of 
non-rewarded (i.e. left flight) events in k until time t. The expected value of Rk can 
be obtain from the following equation, 

E(R k (t)) = [aP A - p(l - P B )} N k . (34) 

In the overlapping area between two distributions of Ra and R B , we cannot 
estimate correctly which is the greater. The overlapping area must decrease as N k 
increases to avoid incorrect judgements. This requirement can be expressed in the 
following forms: 


aP A -p(\-P B ) > 0, (35) 

aP b - p( \ - Pa) < 0. (36) 

These forms can be transformed to the form 

Pb < — Q < Pa- (37) 

a + p 

In other words, the parameter a and f] must satisfy the above conditions for the 
random walk to represent correctly the larger judgement. We can easily confirm 
that the following form satisfies these conditions 


P _ Pa + Pb 
a + p 2 


(38) 
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On the other hand, we use the following learning rule in our TOW dynamics, 


Qk(t) = N k (t ) - (1 + co)L k (t). 


(39) 


Here, m is a weighting parameter. From R k (t)/a = Qu(t), we obtain 


oj = 


P 

a 


From Eq.(381 and (40 1 , we can obtain 


OJ 

y 


y 

2-y 
Pa + Pb- 


(40) 


(41) 

(42) 


Therefore, we can conclude that the algorithm using the learning rule Q k with the 
parameter to can accurately solve the BP. Here, we use ojq for the above to. Detailed 
analytical calculations are presented in Ifl6l . 


The TOW principle 

In many popular algorithms, such as the e-greedy algorithm, an estimate for reward 
probability is updated only in a selected arm. In contrast, we consider the case 
wherein the sum of the reward probabilities y = Pa + Pn is given. Then, we can 
update both estimates simultaneously as follows, 

. N a -L a N a -L a 

A ■ ~NT b r 'TT’ 

N b -L b „ N b -L b 

A - r ~~Np- B ur- 


Here, the top and bottom rows give estimates based on the information that ma¬ 
chines A and B were selected Na and Nb times, respectively. Note that we can also 
update the estimate of the machine that was not played, owing to information y. 

From the above estimates, each expected reward Q' k (k e {A, B\) is given as 
follows, 


Q'a 

, 7 N a -L a , Nb~Lb x 

= N A Na + N B (r- Na ) 



= N a ~ L a + (y - 1) Nb + Lb, 

(43) 

Q'b 

= N A (y- Na ) + N B Nb 



= Nb - Lb + (y - 1) N a + La . 

(44) 
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These expected rewards Q'.s are not the same as the TOW dynamics learning rules, 
QjS (Eq.( 39)). However, what we use substantially in the TOW is the difference, 


Qa~Qb = ( N a - N b ) - (1 + cS) {L a - L B ). 


(45) 


When we transform the expected rewards 0's into 

Q'a = Ga/(2-7), (46) 

Qb = Q'bK 2 -?)’ (47) 


we can obtain the difference, 


Q'a - Qb = (Na ~ N B ) - (L A - L B ). 

2 - y 


(48) 


Comparing the coefficient of Eqs.(451 and (48 1 , those two differences arc always 
equal when co = ojq satisfies, 

too = (49) 


2 — y 


Eventually, we can obtain the nearly optimal weight parameter a>o in terms of y. 

This derivation means that TOW dynamics has a learning rule equivalent to 
that of the system that can simultaneously update both estimates. TOW can imitate 
the system that determines its next moves at time t+1 in referring to the estimate of 
each arm even if it was not selected at time t, as if the two arms were simultaneously 
selected at time t. This unique feature in the learning rule, derived from the fact 
that the sum of reward probabilities is given in advance, is one of the sources of 
TOW’s high performance. 

Performing Monte Carlo simulations, it was confirmed that the performance of 
TOW dynamics with at o is comparable to its best performance, i.e. the TOW with 
u) op t. To derive the a> opt accurately, we need to consider the fluctuations ffTOTl . 

The essence described here can be extended to general K machines cases. If 
you want to separate distributions of the top mth and (m + l)th machine in the 
previous subsection, all you need do is use the following parameter o»o- 


- r ^ (50) 

2 —y 

y ~ P(m) P(m+1 ) (51) 


Here, P( m ) denotes the top mth reward probability. 
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Discussion 

Performances of algorithms that can solve the BP are mostly evaluated using the 
‘ regret ’ defined as follows, which quantifies the accumulated losses of rewards, 

regret = (P A - P B )E(N B ). (52) 

Here, E(N B ) denotes the expected value of N B . 

It is known that optimal algorithms for the BP, defined by Auer et ah, have a 
regret proportional to log(A) @1. The regret has no finite upper bound as N in¬ 
creases because it continues to require playing the lower-reward machine to ensure 
that the probability of incorrect judgment goes to zero. Interestingly, we analyt¬ 
ically demonstrated in our previous work that our TOW dynamics has a constant 
regret |[T6l . A constant regret means that the probability of incorrect judgment 
remains non-zero in TOW dynamics, though this probability is nearly equal to 
zero. However, it would appear that the reward probabilities change frequently in 
actual decision-making situations and their long-term behaviour is not crucial for 
many practical purposes. For this reason, TOW dynamics would be more suited to 
real-world applications. 

Herein, we propose ‘the TOW principle’ which explains why computational 
efficiency can be generated from the movements of physical objects due to the vol¬ 
ume conservation law. In ordinal decision-making algorithms, the parameter of 
‘exploration time’ is optimized for the BP. That parameter ordinarily represents | 
Pa - Pb I (or inverse of it). As a result, we proposed another independent opti¬ 
mization wherein the parameter represents information of Pa+Pb herein. Owing 
to this novel approach to the BP, computational efficiency can be obtained by di¬ 
rectly using physical objects. This idea of the physical implementation of TOW is 
applicable to various fields including constructing completely new analog comput¬ 
ers (I7l|- 


Extended Prisoner’s Dilemma Game 

Consider a situation wherein three people are arrested by police and are required 
to choose from the following five options: 

• A: keep silent, 

• B: confess (implicate him- or herself) 

• C: implicate the next person (circulative as 1,2,3,1,2,3,- • ■), 

• D: implicate the third person (circulative as 1,2,3,1,2,3,- • ■), 
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• E: implicate both of the others. 

According to the three person’s choices, the ‘degree of charges’ (from 0 to 3) are to 
be determined for every person. For example, the degree of charges are (1,1,1) for 
the choice (person 1, person 2, person3) = (1,1,0) for the choice (A,B,C), 

(2.1.1) for the choice (B,C,D), (2,2,1) for the choice ( C,D,D ), (2,2,2) for the 
choice (DJ)J)), (3,1,1) for the choice (B.DJJ) etc. For each pattern of degree of 
charges, a set of reward probabilities of each person are determined as follows: 

• the (0,0,0) is (R2.R2.R2), 

• the (1,1,1) is (7?1,/?1,/?1), 

• the (2,1,1) or (1,2,1) or (1,1,2) is (R,R,R)\ the social maximum, 

• the (2,2,2) is ( P.P.P ): the Nash eqilibrium. 

Otherwise, each difference between his or her degree and the minimum degree 
of the pattern determines a reward probability. If his/her degree is the same as 
the minimum degree, the reward probability is ‘T’, otherwise ‘S’. Moreover, the 
difference between his or her degree and the minimum degree of the pattern is 
added to it. For example, the (1,1,0) is (S 1,5 1,7T), the (2,2,1) is (S 1,51,!T1), the 

(3.1.1) is (S2,T2,T2). Here, we set T3 = 0.79, T2 = 0.76, T\ = 0.73, R = 0.70, 
R\ = 0.60, R2 = 0.55, P = 0.50, S 1 = 0.40, S2 = 0.30 and S3 = 0.20. Therefore, 
the social maximum is (R,R,R). Here, it is assumed that police knows that there are 
a main suspect and two accomplices. The complete list of reward probabilities is 
shown in Table [4] [5} [6] and [7] 
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Table 4: Reward probabilities in the Extended Prisoner’s Dilemma Game. 


selection pattern 

degree of charges 

probability 

( A, A, A ) 

( 0. 0. 0 ) 

0.55 0.55 0.55 

( A, A. B ) 

( 0. 0. 1 ) 

0.73 0.73 0.40 

( A, A, C ) 

(1,0.0) 

0.40 0.73 0.73 

( A, A, D ) 

(0. 1.0) 

0.73 0.40 0.73 

( A, A, E ) 

(1,1,0) 

0.40 0.40 0.73 

( A, B, A ) 

(0. 1.0) 

0.73 0.40 0.73 

( A, B, B ) 

(0. 1, 1) 

0.73 0.40 0.40 

( A, B, C ) 

(1,1,0) 

0.40 0.40 0.73 

( A, B, D ) 

( 0, 2, 0 ) 

0.76 0.30 0.76 

( A, B, E ) 

(1,2.0) 

0.73 0.30 0.76 

(A, C, A ) 

( 0. 0, 1 ) 

0.73 0.73 0.40 

(A, C, B ) 

( 0. 0, 2 ) 

0.76 0.76 0.30 

(A, C, C ) 

(1,0,1) 

0.40 0.73 0.40 

( A, C, D ) 

(0, 1, 1) 

0.73 0.40 0.40 

( A, C, E ) 

(1,1,1) 

0.60 0.60 0.60 

( A, D, A ) 

(1,0.0) 

0.40 0.73 0.73 

( A, D, B ) 

(1,0.1) 

0.40 0.73 0.40 

( A, D, C ) 

( 2, 0, 0 ) 

0.30 0.76 0.76 

( A, D, D ) 

(1,1,0) 

0.40 0.40 0.73 

( A, D, E ) 

(2,1,0) 

0.30 0.73 0.76 

(A, E, A ) 

(1,0.1) 

0.40 0.73 0.40 

( A, E, B ) 

(1,0,2) 

0.73 0.76 0.30 

( A, E, C ) 

( 2, 0, 1 ) 

0.30 0.76 0.73 

( A, E, D ) 

(1,1,1) 

0.60 0.60 0.60 

( A, E, E ) 

(2,1,1) 

0.70 0.70 0.70 

( B, A, A ) 

(1,0,0) 

0.40 0.73 0.73 

(B.A.B) 

(1,0.1) 

0.40 0.73 0.40 

( B, A, C ) 

( 2, 0, 0 ) 

0.30 0.76 0.76 

( B, A, D ) 

(1,1,0) 

0.40 0.40 0.73 

( B, A, E ) 

(2,1,0) 

0.30 0.73 0.76 

( B, B, A ) 

(1,1.0) 

0.40 0.40 0.73 

(B.B.B) 

(1,1,1) 

0.60 0.60 0.60 

( B, B, C ) 

(2,1,0) 

0.30 0.73 0.76 

( B, B, D ) 

(1,2.0) 

0.73 0.30 0.76 

( B, B, E ) 

(2,2,0) 

0.30 0.30 0.76 

( B, C, A ) 

(1,0,1) 

0.40 0.73 0.40 
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Table 5: Reward probabilities in the Extended Prisoner’s Dilemma Game. 


selection pattern 

degree of charges 

probability 

(B.C.B) 

(1,0.2) 

0.73 0.76 0.30 

( B, C, C ) 

( 2, 0. 1 ) 

0.30 0.76 0.73 

( B, C, D ) 

(1,1,1) 

0.60 0.60 0.60 

( B, C, E ) 

(2,1,1) 

0.70 0.70 0.70 

( B, D, A ) 

( 2, 0. 0 ) 

0.30 0.76 0.76 

(B.D.B) 

( 2, 0. 1 ) 

0.30 0.76 0.73 

( B, D, C ) 

(3,0.0) 

0.20 0.79 0.79 

( B, D, D ) 

(2.1.0) 

0.30 0.73 0.76 

( B, D, E ) 

(3,1.0) 

0.20 0.76 0.79 

( B, E, A ) 

( 2, 0. 1 ) 

0.30 0.76 0.73 

(B.E.B) 

( 2, 0, 2 ) 

0.30 0.76 0.30 

( B, E, C ) 

(3,0. 1) 

0.20 0.79 0.76 

( B, E, D ) 

(2,1,1) 

0.70 0.70 0.70 

( B, E, E ) 

(3, 1, 1) 

0.30 0.76 0.76 

(C, A, A) 

(0, 1,0) 

0.73 0.40 0.73 

( C, A, B ) 

(0, 1, 1) 

0.73 0.40 0.40 

( C, A, C ) 

(1,1,0) 

0.40 0.40 0.73 

( C, A, D ) 

( 0, 2, 0 ) 

0.76 0.30 0.76 

( C, A, E ) 

(1,2,0) 

0.73 0.30 0.76 

( C, B, A ) 

( 0, 2, 0 ) 

0.76 0.30 0.76 

( C, B, B ) 

( 0, 2, 1 ) 

0.76 0.30 0.73 

( C, B, C ) 

(1,2,0) 

0.73 0.30 0.76 

( C, B, D ) 

(0,3,0) 

0.79 0.20 0.79 

( C, B, E ) 

(1,3,0) 

0.76 0.20 0.79 

( C, C, A ) 

(0, 1, 1) 

0.73 0.40 0.40 

( C, C, B ) 

(0,1,2) 

0.76 0.73 0.30 

(C,C,C) 

(1,1,1) 

0.60 0.60 0.60 

(C, C, D ) 

( 0, 2, 1) 

0.76 0.30 0.73 

( C, C, E ) 

(1,2,1) 

0.70 0.70 0.70 

(C, D, A) 

(1,1,0) 

0.40 0.40 0.73 

( C, D, B ) 

(1,1,1) 

0.60 0.60 0.60 

( C, D, C ) 

(2,1,0) 

0.30 0.73 0.76 

( C, D, D ) 

(1,2,0) 

0.73 0.30 0.76 

( C, D, E ) 

(2,2,0) 

0.30 0.30 0.76 

( C, E, A ) 

(1,1,1) 

0.60 0.60 0.60 

( C, E, B ) 

(1,1,2) 

0.70 0.70 0.70 
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Table 6: Reward probabilities in the Extended Prisoner’s Dilemma Game. 


selection pattern 

degree of charges 

probability 

( C, E, C ) 

(2,1,1) 

0.70 0.70 0.70 

( C, E, D ) 

(1,2,1) 

0.70 0.70 0.70 

( C, E, E ) 

(2,2,1) 

0.40 0.40 0.73 

( D, A, A ) 

( 0, 0, 1 ) 

0.73 0.73 0.40 

( D, A, B ) 

( 0, 0, 2 ) 

0.76 0.76 0.30 

( D, A, C ) 

(1,0,1) 

0.40 0.73 0.40 

( D, A, D ) 

(0,1,1) 

0.73 0.40 0.40 

( D, A, E ) 

(1,1,1) 

0.60 0.60 0.60 

( D, B, A ) 

(0,1,1) 

0.73 0.40 0.40 

( D, B, B ) 

(0, 1,2) 

0.76 0.73 0.30 

( D, B, C ) 

(1,1,1) 

0.60 0.60 0.60 

( D, B, D ) 

( 0, 2, 1 ) 

0.76 0.30 0.73 

( D, B, E ) 

(1,2,1) 

0.70 0.70 0.70 

( D, C, A ) 

( 0, 0, 2 ) 

0.76 0.76 0.30 

( D, C, B ) 

( 0, 0, 3 ) 

0.79 0.79 0.20 

( D, C, C ) 

(1,0,2) 

0.73 0.76 0.30 

( D, C, D ) 

(0, 1,2) 

0.76 0.73 0.30 

( D, C, E ) 

(1,1,2) 

0.70 0.70 0.70 

( D. D, A ) 

(1,0,1) 

0.40 0.73 0.40 

( D, D, B ) 

(1,0,2) 

0.73 0.76 0.30 

( D, D, C ) 

( 2, 0, 1 ) 

0.30 0.76 0.73 

( D, D, D ) 

(1,1,1) 

0.60 0.60 0.60 

(D, D,E) 

(2,1,1) 

0.70 0.70 0.70 

( D, E, A ) 

(1,0,2) 

0.73 0.76 0.30 

( D, E, B ) 

(1,0,3) 

0.76 0.79 0.20 

( D, E, C ) 

( 2, 0, 2 ) 

0.30 0.76 0.30 

( D, E, D ) 

(1,1,2) 

0.70 0.70 0.70 

( D, E, E ) 

(2,1,2) 

0.40 0.73 0.40 

(E, A, A ) 

(0, 1, 1) 

0.73 0.40 0.40 

( E, A, B ) 

(0, 1,2) 

0.76 0.73 0.30 

( E, A, C ) 

(1,1,1) 

0.60 0.60 0.60 

(E, A, D ) 

( 0, 2, 1 ) 

0.76 0.30 0.73 

( E, A, E ) 

(1,2,1) 

0.70 0.70 0.70 

( E, B, A ) 

( 0, 2, 1 ) 

0.76 0.30 0.73 

(E.B.B) 

( 0, 2, 2 ) 

0.76 0.30 0.30 

( E, B, C ) 

(1,2,1) 

0.70 0.70 0.70 
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Table 7: Reward probabilities in the Extended Prisoner’s Dilemma Game. 


selection pattern 

degree of charges 

probability 

( E, B, D ) 

( 0. 3, 1 ) 

0.79 0.20 0.76 

(E, B, E ) 

(1,3,1) 

0.76 0.30 0.76 

( E, C, A ) 

(0.1,2) 

0.76 0.73 0.30 

( E, C, B ) 

(0, 1,3) 

0.79 0.76 0.20 

( E, C, C ) 

(1,1,2) 

0.70 0.70 0.70 

( E, C, D ) 

( 0, 2, 2 ) 

0.76 0.30 0.30 

(E, C, E ) 

(1,2,2) 

0.73 0.40 0.40 

(E, D, A ) 

(1,1,1) 

0.60 0.60 0.60 

( E, D, B ) 

(1,1,2) 

0.70 0.70 0.70 

( E, D, C ) 

(2,1,1) 

0.70 0.70 0.70 

(E, D, D ) 

(1,2,1) 

0.70 0.70 0.70 

( E, D, E ) 

(2,2,1) 

0.40 0.40 0.73 

( E, E, A ) 

(1,1,2) 

0.70 0.70 0.70 

(E, E, B ) 

(1,1,3) 

0.76 0.76 0.30 

(E, E, C ) 

(2,1,2) 

0.40 0.73 0.40 

( E, E, D ) 

(1,2,2) 

0.73 0.40 0.40 

(E, E, E ) 

(2,2,2) 

0.50 0.50 0.50 
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